Parsed with column specification:
cols(
ID_REF = col_character(),
KO1 = col_double(),
KO2 = col_double(),
KO3 = col_double(),
WT1 = col_double(),
WT2 = col_double(),
WT3 = col_double()
)
Objectives
- Load microarray dataset into R
- Explore the dataset with basic visualizations
- Identify differentially expressed genes (DEGs)
- Generate annotation of the DEGs (Tentative)
The Central Dogma of Biology
Cleft Lip and Palate 1/3
Cleft lip and cleft palate (CLP) are splits in the upper lip, the roof of the mouth (palate) or both. They result when facial structures that are developing in an unborn baby do not close completely. CLP is one of the most common birth defects with a frequency of 1/700 live births.
Cleft Lip and Palate 2/3
Children with cleft lip with or without cleft palate face a variety of challenges, depending on the type and severity of the cleft.
Difficulty feeding. One of the most immediate concerns after birth is feeding.
Ear infections and hearing loss. Babies with cleft palate are especially at risk of developing middle ear fluid and hearing loss.
Dental problems. If the cleft extends through the upper gum, tooth development may be affected.
Speech difficulties. Because the palate is used in forming sounds, the development of normal speech can be affected by a cleft palate. Speech may sound too nasal.
Reference: Mayo Foundation for Medical Education and Research
Cleft Lip and Palate 3/3
DNA variation in Interferon Regulatory Factor 6 (IRF6) causes Van der Woude syndrome (VWS)
VWS is the most common syndromic form of cleft lip and palate.
However, the causing variant in IRF6 has been found in only 70% of VWS families!
IRF6 is a transcription factor with a conserved helix-loop-helix DNA binding domain and a less well-conserved protein binding domain.
Reference: Hum Mol Genet. 2014 May 15; 23(10): 2711–2720
Question
Given:
The pathogenic variant in IRF6 exists in only 70% of the VWS families
IRF6 is a transcription factor
How can we identify other genes that might be involved in the remaining 30% of the VWS families?
Hint
Usually, genes that are regulated by a transcription factor belong to the same biological process or pathway.
Therefore, by comparing the gene expression patterns between wild-type (functional) Irf6 and knockout (non-functional) Irf6, it could be possible to identify genes that are regulated (targeted) by Irf6.
Hypothesis
\(H_O : \mu_{WT} = \mu_{KO}\)
\(H_A : \mu_{WT} \ne \mu_{KO}\)
Where \(\mu\) is the mean of the gene expression values of a gene.
One-sided or Two-sided testing?

Why Microarray?

Why Microarray?
No need for candidate genes (or genes of interest)
One experiment assesses the entire transcriptome
One experiment generates many hypotheses
Only small amount of RNA is required (~15–200 ng)

Experimental Design
- 3 IRF6 wild-type (+/+) and 3 knockout (-/-) mouse embryos.
- E17.5 embryos were removed from euthanized mothers.
- Skin was removed from embryos.
- Total RNA was isolated from the skin.
- Resultant RNA was hybridized to Affymetrix GeneChip Mouse Genome 430 2.0 arrays.

Dataset
- The original dataset can be obtained from NCBI GEO with accession GSE5800
Loading
First, we are going to load the dataset from the .tsv file into R as a variable called data using the read.table function.
data is just an arbitrary varilable name to hold the result of read.table and it can be called/named almost anything.
# Load the data from a file into a variable
data = read.table("https://raw.githubusercontent.com/ahmedmoustafa/AUCBIOT5206/master/microarray/datasets/irf6.tsv", header = TRUE, row.names = 1)
# Convert the data.frame (table) in a matrix (numeric)
data = as.matrix(data)
Note: the hash sign (#) indicates that what comes after is a comment. Comments are for documentation and readability of the R code and they are not evaluated (or executed).
Checking
dim(data) # Dimension of the dataset
[1] 45101 6
head(data) # First few rows
KO1 KO2 KO3 WT1 WT2 WT3
1415670_at 6531.0 5562.8 6822.4 7732.1 7191.2 7551.9
1415671_at 11486.3 10542.7 10641.4 10408.2 9484.5 7650.2
1415672_at 14339.2 13526.1 14444.7 12936.6 13841.7 13285.7
1415673_at 3156.8 2219.5 3264.4 2374.2 2201.8 2525.3
1415674_a_at 4002.0 3306.9 3777.0 3760.6 3137.0 2911.5
1415675_at 3468.4 3347.4 3332.9 3073.5 3046.0 2914.4
Number of Genes and IDs
number_of_genes = nrow(data) # number of genes = number of rows
number_of_genes
[1] 45101
ids = row.names(data) # The ids of the genes are the names of the rows
head(ids)
[1] "1415670_at" "1415671_at" "1415672_at" "1415673_at" "1415674_a_at" "1415675_at"
Exploring
Check the behavior of the data (e.g., normal?, skewed?)
hist(data, col = "gray", main="Histogram")

Boxplot
colors = c(rep("navy", 3), rep("orange", 3))
boxplot(data2, col = colors)

Clustering 1/2
Hierarchical clustering of the samples (i.e., columns) based on the correlation coefficients of the expression values
hc = hclust(as.dist(1 - cor(data2)))
plot(hc)

Clustering 2/2
To learn more about a function (e.g., hclust), you may type ?function (e.g., ?hclust) in the console to launch R documentation on that function:
Splitting Data Matrix into Two 1/2
ko = data2[, 1:3] # KO matrix
head(ko)
KO1 KO2 KO3
1415670_at 12.67309 12.44160 12.73606
1415671_at 13.48763 13.36396 13.37740
1415672_at 13.80768 13.72346 13.81825
1415673_at 11.62425 11.11602 11.67260
1415674_a_at 11.96651 11.69126 11.88303
1415675_at 11.76005 11.70883 11.70256
Splitting Data Matrix into Two 2/2
wt = data2[, 4:6] # WT matrix
head(wt)
WT1 WT2 WT3
1415670_at 12.91664 12.81202 12.88262
1415671_at 13.34543 13.21136 12.90128
1415672_at 13.65917 13.75673 13.69759
1415673_at 11.21323 11.10447 11.30224
1415674_a_at 11.87675 11.61517 11.50755
1415675_at 11.58567 11.57270 11.50898
Gene (Row) Mean Expression
# Compute the means of the KO samples
ko.means = rowMeans(ko)
head(ko.means)
1415670_at 1415671_at 1415672_at 1415673_at 1415674_a_at 1415675_at
12.61692 13.40966 13.78313 11.47096 11.84693 11.72381
# Compute the means of the WT samples
wt.means = rowMeans(wt)
head(wt.means)
1415670_at 1415671_at 1415672_at 1415673_at 1415674_a_at 1415675_at
12.87043 13.15269 13.70450 11.20664 11.66649 11.55578
Scatter 1/2
plot(ko.means ~ wt.means) # The actual scatter plot
abline(0, 1, col = "red") # Only a diagonal line

Scatter 2/2
pairs(data2) # All pairwise comparisons

Differentially Expressed Genes (DEGs)
To identify DEGs, we will identify:
- Biologically significantly differentially expressed
- Statistically significantly differentially expressed
Then, we will take the overlap (intersection) of the two sets

Biological Significance (fold-change) 1/2
fold = ko.means - wt.means # Difference between means
head(fold)
1415670_at 1415671_at 1415672_at 1415673_at 1415674_a_at 1415675_at
-0.25351267 0.25697097 0.07863227 0.26431191 0.18044345 0.16803065
What do the positive and negative values of the fold-change indicate? Considering the WT condition is the reference (or control)
+ve fold-change \(\rightarrow\) Up-regulation \(\uparrow\)
-ve fold-change \(\rightarrow\) Down-regulation \(\downarrow\)
Biological Significance (fold-change) 2/2
hist(fold, col = "gray") # Histogram of the fold

Statistical Significance (p-value) 1/3
- To assess the statistical significance of the difference in the expression values for each gene between the two conditions (e.g.,
WT and KO), we are going to use t-test.

t-test
Let’s say there are two samples x and y from the two populations, X and Y, respectively, to determine whether the means of two populations are significantly different, we can use t.test.
?t.test
t-test : Example 1
x = c(4, 3, 10, 7, 9) ; y = c(7, 4, 3, 8, 10)
t.test(x, y)
Welch Two Sample t-test
data: x and y
t = 0.1066, df = 7.9743, p-value = 0.9177
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-4.12888 4.52888
sample estimates:
mean of x mean of y
6.6 6.4
t.test(x, y)$p.value
[1] 0.917739
t-test : Example 2
x = c(6, 8, 10, 7, 9) ; y = c(3, 2, 1, 4, 5)
t.test(x, y)
Welch Two Sample t-test
data: x and y
t = 5, df = 8, p-value = 0.001053
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
2.693996 7.306004
sample estimates:
mean of x mean of y
8 3
t.test(x, y)$p.value
[1] 0.001052826
Statistical Significance (p-value) 2/3
Let’s compute the p-value for all genes using a for-loop of t.test, one gene at a time:
pvalue = NULL # Empty list for the p-values
for(i in 1 : number_of_genes) { # for each gene from to the number of genes
x = wt[i, ] # wt values of gene number i
y = ko[i, ] # ko values of gene number i
t = t.test(x, y) # t-test between the two conditions
pvalue[i] = t$p.value # Store p-value number i into the list of p-values
}
head(pvalue)
[1] 0.092706280 0.182663337 0.129779075 0.272899180 0.262377176 0.005947807
Statistical Significance (p-value) 3/3
hist(-log10(pvalue), col = "gray") # Histogram of p-values (-log10)

Volcano : Statistical & Biological 1/3
plot(-log10(pvalue) ~ fold)

Volcano : Statistical & Biological 2/3
fold_cutoff = 2
pvalue_cutoff = 0.01
plot(-log10(pvalue) ~ fold)
abline(v = fold_cutoff, col = "blue", lwd = 3)
abline(v = -fold_cutoff, col = "red", lwd = 3)
abline(h = -log10(pvalue_cutoff), col = "green", lwd = 3)
Volcano : Statistical & Biological 3/3

Filtering for DEGs 1/3
filter_by_fold = abs(fold) >= fold_cutoff # Biological
sum(filter_by_fold) # Number of genes staisfy the condition
[1] 1051
filter_by_pvalue = pvalue <= pvalue_cutoff # Statistical
sum(filter_by_pvalue)
[1] 1564
filter_combined = filter_by_fold & filter_by_pvalue # Combined
sum(filter_combined)
[1] 276
Filtering for DEGs 2/3
filtered = data2[filter_combined, ]
dim(filtered)
[1] 276 6
head(filtered)
KO1 KO2 KO3 WT1 WT2 WT3
1416200_at 13.312004 12.973357 12.868456 7.40429 8.558803 8.683696
1416236_a_at 14.148397 14.039236 14.130007 12.23604 12.022402 11.495056
1417808_at 5.321928 5.442943 4.053111 15.16978 15.070087 14.753274
1417932_at 10.602884 10.257152 10.496055 13.98445 14.203294 13.720960
1418050_at 10.622052 10.975490 10.795066 12.86513 13.012048 12.658122
1418100_at 9.117903 8.634811 9.057721 12.90358 12.842449 12.233769
Filtering for DEGs 3/3
plot(-log10(pvalue) ~ fold)
points(-log10(pvalue[filter_combined]) ~ fold[filter_combined],
col = "green")

Exercise
On the volcano plot, highlight the up-regulated genes in red and the download-regulated genes in blue
Solution 1/2
# Screen for the up-regulated genes (+ve fold)
filter_up = filter_combined & fold > 0
head(filter_up)
1415670_at 1415671_at 1415672_at 1415673_at 1415674_a_at 1415675_at
FALSE FALSE FALSE FALSE FALSE FALSE
# Number of filtered genes
sum(filter_up)
[1] 95
# Screen for the down-regulated genes (-ve fold)
filter_down = filter_combined & fold < 0
head(filter_down)
1415670_at 1415671_at 1415672_at 1415673_at 1415674_a_at 1415675_at
FALSE FALSE FALSE FALSE FALSE FALSE
# Number of filtered genes
sum(filter_down)
[1] 181
Solution 2/2
plot(-log10(pvalue) ~ fold)
points(-log10(pvalue[filter_up]) ~ fold[filter_up], col = "red")
points(-log10(pvalue[filter_down]) ~ fold[filter_down], col = "blue")

Heatmap 1/5
heatmap(filtered)

Heatmap 2/5
By default, heatmap clusters genes (rows) and samples (columns) based on the Euclidean distance.
In the context of gene expression, we need to cluster genes and samples based on the correlation to explore patterns of co-regulation (co-expression) - Guilt by Association.
To let heatmap cluster the genes and/or samples, the genes and samples will be clustered (grouped) by correlation coefficients (using cor) among the genes and samples.
Heatmap 3/5
# Clustering of the columns (samples)
col_dendrogram = as.dendrogram(hclust(as.dist(1-cor(filtered))))
# Clustering of the rows (genes)
row_dendrogram = as.dendrogram(hclust(as.dist(1-cor(t(filtered)))))

Heatmap 4/5
# Heatmap with the rows and columns clustered by correlation coefficients
heatmap(filtered, Rowv=row_dendrogram, Colv=col_dendrogram)

Heatmap 5/5
library(gplots) # Load the gplots library
heatmap(filtered, Rowv=row_dendrogram, Colv=col_dendrogram, col = rev(redgreen(1024)))

Annnotation 1/3
To obtain the functional annotation of the differentially expressed genes, we are going first to extract their probe ids:
filterd_ids = row.names(filtered) # ids of the filtered DE genes
length(filterd_ids)
[1] 276
head(filterd_ids)
[1] "1416200_at" "1416236_a_at" "1417808_at" "1417932_at" "1418050_at" "1418100_at"
Sanity Check (Irf6)

Multiple Testing Correction 1/3
We conducted 45101 statistical tests. The computed p-values need to be corrected for multiple testing. The correction can be performed using p.adjust, which simply takes the orignial p-values a vector and returns the adjusted (corrected) p-values:
adjusted.pvalues = p.adjust(pvalue, method = "fdr")
Number original p-values \(\leq\) 0.05 = 5099 while the number adjusted (corrected) p-values < 0.05 \(\geq\) 9
Multiple Testing Correction 2/3
Here is an example of the original p-values and corresponding adjusted p-values:
Multiple Testing Correction 3/3

Homework
- Identify the top 10 biologically significant genes (i.e., by fold-change)
- Identify the top 10 statistically significant genes (i.e., by p-value)
---
title: 'Microarray Gene Expression Analysis with R'
output: 
  html_notebook: 
    toc: yes
subtitle: 'Example: Interferon Regulatory Factor 6 (*IRF6*)'
license: by-sa
---

```{r libraries, echo=FALSE, message=FALSE, warning=FALSE}
library(readr)
library(printr)
library(dplyr)
library(ggplot2)
library(cowplot)
```

## Objectives
- Load microarray dataset into R
- Explore the dataset with basic visualizations
- Identify differentially expressed genes (DEGs)
- Generate annotation of the DEGs (*Tentative*)

<center>
![](images/title.png "Microarry Analysis with R")
</center>

## The Central Dogma of Biology
![DNA makes RNA and RNA makes protein](images/dogma.png "The Central Dogma of Biology")


## Cleft Lip and Palate 1/3

Cleft lip and cleft palate (**CLP**) are splits in the upper lip, the roof of the mouth (palate) or both. They result when facial structures that are developing in an unborn baby do not close completely. CLP is one of the most common birth defects with a frequency of 1/700 live births.

![Cleft lip and palate](images/cleft.jpg)

## Cleft Lip and Palate 2/3

Children with cleft lip with or without cleft palate face a variety of challenges, depending on the type and severity of the cleft.

- **Difficulty feeding.** One of the most immediate concerns after birth is feeding.

- **Ear infections and hearing loss.** Babies with cleft palate are especially at risk of developing middle ear fluid and hearing loss.

- **Dental problems.** If the cleft extends through the upper gum, tooth development may be affected.

- **Speech difficulties.** Because the palate is used in forming sounds, the development of normal speech can be affected by a cleft palate. Speech may sound too nasal.

*Reference*: [Mayo Foundation for Medical Education and Research](https://www.mayoclinic.org/diseases-conditions/cleft-palate/symptoms-causes/syc-20370985)

## Cleft Lip and Palate 3/3

- DNA variation in Interferon Regulatory Factor 6 (**IRF6**) causes Van der Woude syndrome (**VWS**)

- VWS is the most common syndromic form of cleft lip and palate.

- However, the causing variant in IRF6 has been found in *only* 70% of VWS families!

- IRF6 is a **transcription factor** with a conserved helix-loop-helix DNA binding domain and a less well-conserved protein binding domain. 

*Reference*: [Hum Mol Genet. 2014 May 15; 23(10): 2711–2720](http://doi.org/10.1093/hmg/ddt664)

## Question

Given:

1. The pathogenic variant in IRF6 exists in only 70% of the VWS families

2. IRF6 is a transcription factor

How can we identify other genes that might be involved in the remaining 30% of the VWS families?

## Hint

- Usually, genes that are regulated by a transcription factor belong to the same biological process or pathway.

- Therefore, by comparing the gene expression patterns between wild-type (functional) *Irf6* and knockout (non-functional) *Irf6*, it could be possible to identify genes that are regulated (targeted) by *Irf6*.

## Hypothesis

- \(H_O : \mu_{WT} = \mu_{KO}\)

- \(H_A : \mu_{WT} \ne \mu_{KO}\)

- Where \(\mu\) is the *mean* of the gene expression values of a gene.

- **One**-sided or **Two**-sided testing?

```{r sides, echo=FALSE, message=FALSE, fig.height=2}
n = 1e6
cutoff = qnorm(0.05)
mydata = data.frame(x = rnorm(n))
left = mydata %>% filter(x < cutoff)
right = mydata %>% filter(x > -cutoff)
both = mydata %>% filter(x < cutoff | x > -cutoff)

binwidth = 1e-1

p = ggplot()
p = p + geom_histogram(data = mydata, aes(x = x), fill = "gray", binwidth = binwidth)
p = p + geom_histogram(data = left, aes(x = x), fill = "red", binwidth = binwidth)
p = p + labs(x = "", y = "")
p = p + theme_light()
p = p + theme(axis.text.x = element_blank())
p = p + theme(axis.text.y = element_blank())
p = p + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p1 = p

p = ggplot()
p = p + geom_histogram(data = mydata, aes(x = x), fill = "gray", binwidth = binwidth)
p = p + geom_histogram(data = right, aes(x = x), fill = "blue", binwidth = binwidth)
p = p + labs(x = "", y = "")
p = p + theme_light()
p = p + theme(axis.text.x = element_blank())
p = p + theme(axis.text.y = element_blank())
p = p + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p2 = p

p = ggplot()
p = p + geom_histogram(data = mydata, aes(x = x), fill = "gray", binwidth = binwidth)
p = p + geom_histogram(data = left, aes(x = x), fill = "red", binwidth = binwidth)
p = p + geom_histogram(data = right, aes(x = x), fill = "blue", binwidth = binwidth)
p = p + labs(x = "", y = "")
p = p + theme_light()
p = p + theme(axis.text.x = element_blank())
p = p + theme(axis.text.y = element_blank())
p = p + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p3 = p

plot_grid(p1, p2, p3, labels = c("a", "b", "c"), ncol = 3)
```

## Why Microarray?

![](images/one-does-not-simply.jpg)

## Why Microarray?

- No need for candidate genes (or genes of interest)

- One experiment assesses the entire transcriptome

- One experiment generates many hypotheses

- Only small amount of RNA is required (~15–200 ng)

![](images/chip.jpg)


## Original Paper

![PMID: 17041601](images/pmid17041601.png)

## Experimental Design

- 3 IRF6 wild-type (+/+) and 3 knockout (-/-) mouse embryos.
- E17.5 embryos were removed from euthanized mothers.
- Skin was removed from embryos.
- Total RNA was isolated from the skin.
- Resultant RNA was hybridized to Affymetrix GeneChip Mouse Genome 430 2.0 arrays.

![](images/mice.png)

## Dataset

- The original dataset can be obtained from NCBI GEO with accession [GSE5800](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5800)

```{r dataset, echo=FALSE, message=FALSE, warning=FALSE}
df = read_tsv("data/Irf6.tsv")[1:4,]
head(df)
```


## Loading
First, we are going to load the dataset from the `.tsv` file into `R` as a variable called `data` using the [`read.table`](http://www.inside-r.org/r-doc/utils/read.table) function.
<br>
`data` is just an arbitrary **varilable** name to hold the result of `read.table` and it can be called/named *almost* anything.

```{r}
# Load the data from a file into a variable
data = read.table("https://raw.githubusercontent.com/ahmedmoustafa/AUCBIOT5206/master/microarray/datasets/irf6.tsv", header = TRUE, row.names = 1)

# Convert the data.frame (table) in a matrix (numeric)
data = as.matrix(data)
```

**Note:** the hash sign (`#`) indicates that what comes after is a *comment*. Comments are for documentation and readability of the `R` code and they are not evaluated (or executed).

## Checking

```{r}
dim(data) # Dimension of the dataset
head(data) # First few rows
```

## Number of Genes and IDs
```{r}
number_of_genes = nrow(data) # number of genes = number of rows
number_of_genes

ids = row.names(data) # The ids of the genes are the names of the rows
head(ids)
```

## Exploring
Check the behavior of the data (e.g., normal?, skewed?)

```{r}
hist(data, col = "gray", main="Histogram")
```

## Transforming

\(log_2\) transformation (why?)

```{r}
data2 = log2(data)
hist(data2, col = "gray")
```

## Boxplot

```{r}
colors = c(rep("navy", 3), rep("orange", 3))
boxplot(data2, col = colors)
```

## Clustering 1/2

Hierarchical clustering of the **samples** (i.e., columns) based on the [correlation coefficients](http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient) of the expression values

```{r}
hc = hclust(as.dist(1 - cor(data2)))
plot(hc)
```

## Clustering 2/2
To learn more about a function (e.g., `hclust`), you may type `?function` (e.g., `?hclust`) in the `console` to launch `R` documentation on that function:

## Splitting Data Matrix into Two 1/2
```{r}
ko = data2[, 1:3] # KO matrix
head(ko)
```

## Splitting Data Matrix into Two 2/2
```{r}
wt = data2[, 4:6] # WT matrix
head(wt)
```


## Gene (Row) Mean Expression

```{r}
# Compute the means of the KO samples
ko.means = rowMeans(ko)
head(ko.means)

# Compute the means of the WT samples
wt.means = rowMeans(wt)
head(wt.means)

```


## Scatter 1/2
```{r}
plot(ko.means ~ wt.means) # The actual scatter plot
abline(0, 1, col = "red") # Only a diagonal line
```

## Scatter 2/2
```{r}
pairs(data2) # All pairwise comparisons
```

## Differentially Expressed Genes (DEGs)

To identify DEGs, we will identify:

- **Biologically** significantly differentially expressed
- **Statistically** significantly differentially expressed

Then, we will take the **overlap** (**intersection**) of the two sets

![](images/intersection.png)

## Biological Significance (fold-change) 1/2
```{r}
fold = ko.means - wt.means # Difference between means
head(fold)
```

- What do the positive and negative values of the fold-change indicate? Considering the `WT` condition is the **reference** (or **control**)

- **+ve** fold-change \(\rightarrow\) **Up**-regulation \(\uparrow\)
- **-ve** fold-change \(\rightarrow\) **Down**-regulation \(\downarrow\)

## Biological Significance (fold-change) 2/2
```{r}
hist(fold, col = "gray") # Histogram of the fold
```

## Statistical Significance (*p*-value) 1/3
- To assess the statistical significance of the difference in the expression values for each gene between the two conditions (e.g., `WT` and `KO`), we are going to use [*t*-test](http://en.wikipedia.org/wiki/Student%27s_t-test).

```{r echo=FALSE,message=FALSE,warning=FALSE,fig.height=3}
n = 1e6
x = data.frame(Value = rnorm(n, m = 0.1, sd = 1), Condition = "X")
y = data.frame(Value = rnorm(n, m = -0.1, sd = 1), Condition = "Y")
z = x %>% bind_rows(y)
p = ggplot(z)
p = p + geom_density(aes(x = Value, fill = Condition), alpha = 0.5)
p = p + theme_light()
p = p + theme(legend.position = "top")
p1 = p

x = data.frame(Value = rnorm(n, m = 2, sd = 1), Condition = "X")
y = data.frame(Value = rnorm(n, m = -2, sd = 1), Condition = "Y")
z = x %>% bind_rows(y)
p = ggplot(z)
p = p + geom_density(aes(x = Value, fill = Condition), alpha = 0.5)
p = p + theme_light()
p = p + theme(legend.position = "top")
p2 = p

plot_grid(p1, p2, labels = c("a", "b"))

```

## *t*-test

Let's say there are two samples *x* and *y* from the two populations, *X* and *Y*, respectively, to determine whether the means of two populations are significantly different, we can use `t.test`.

```{r}
?t.test
```

## *t*-test : Example 1

```{r}
x = c(4, 3, 10, 7, 9) ; y = c(7, 4, 3, 8, 10)
t.test(x, y)
```

```{r}
t.test(x, y)$p.value
```

## *t*-test : Example 2

```{r}
x = c(6, 8, 10, 7, 9) ; y = c(3, 2, 1, 4, 5)
t.test(x, y)
```

```{r}
t.test(x, y)$p.value
```

## Statistical Significance (*p*-value) 2/3

Let's compute the *p*-value for all genes using a `for`-loop of `t.test`, one gene at a time:

```{r}
pvalue = NULL # Empty list for the p-values

for(i in 1 : number_of_genes) { # for each gene from to the number of genes
  x = wt[i, ] # wt values of gene number i
  y = ko[i, ] # ko values of gene number i
  t = t.test(x, y) # t-test between the two conditions
  pvalue[i] = t$p.value # Store p-value number i into the list of p-values
}
head(pvalue)
```

## Statistical Significance (*p*-value) 3/3
```{r}
hist(-log10(pvalue), col = "gray") # Histogram of p-values (-log10)
```

## Volcano : Statistical & Biological 1/3
```{r}
plot(-log10(pvalue) ~ fold)
```

## Volcano : Statistical & Biological 2/3
```{r eval=FALSE}
fold_cutoff = 2
pvalue_cutoff = 0.01

plot(-log10(pvalue) ~ fold)

abline(v = fold_cutoff, col = "blue", lwd = 3)
abline(v = -fold_cutoff, col = "red", lwd = 3)
abline(h = -log10(pvalue_cutoff), col = "green", lwd = 3)
```

## Volcano : Statistical & Biological 3/3
```{r echo=FALSE}
fold_cutoff = 2
pvalue_cutoff = 0.01

plot(-log10(pvalue) ~ fold)

abline(v = fold_cutoff, col = "blue", lwd = 3)
abline(v = -fold_cutoff, col = "red", lwd = 3)
abline(h = -log10(pvalue_cutoff), col = "green", lwd = 3)
```


## Filtering for DEGs 1/3
```{r}
filter_by_fold = abs(fold) >= fold_cutoff # Biological
sum(filter_by_fold) # Number of genes staisfy the condition

filter_by_pvalue = pvalue <= pvalue_cutoff # Statistical
sum(filter_by_pvalue)

filter_combined = filter_by_fold & filter_by_pvalue # Combined
sum(filter_combined)
```

## Filtering for DEGs 2/3
```{r}
filtered = data2[filter_combined, ]
dim(filtered)
head(filtered)
```

## Filtering for DEGs 3/3
```{r}
plot(-log10(pvalue) ~ fold)
points(-log10(pvalue[filter_combined]) ~ fold[filter_combined],
       col = "green")
```

## Exercise
On the volcano  plot, highlight the up-regulated genes in red and the download-regulated genes in blue

## Solution 1/2

- Up-regulated genes
```{r}
# Screen for the up-regulated genes (+ve fold)
filter_up = filter_combined & fold > 0

head(filter_up)
```

```{r}
# Number of filtered genes
sum(filter_up)
```

- Down-regulated genes
```{r}
# Screen for the down-regulated genes (-ve fold)
filter_down = filter_combined & fold < 0

head(filter_down)
```

```{r}
# Number of filtered genes
sum(filter_down)
```

## Solution 2/2
```{r}
plot(-log10(pvalue) ~ fold)
points(-log10(pvalue[filter_up]) ~ fold[filter_up], col = "red")
points(-log10(pvalue[filter_down]) ~ fold[filter_down], col = "blue")
```

## Heatmap 1/5
```{r}
heatmap(filtered)
```

## Heatmap 2/5
- By default, `heatmap` clusters genes (rows) and samples (columns) based on [the Euclidean distance](http://en.wikipedia.org/wiki/Euclidean_distance).

- In the context of gene expression, we need to cluster genes and samples based on the correlation to explore patterns of **[co-regulation](http://dx.doi.org/10.1186/1471-2105-5-18)** (**co-expression**) - *Guilt by Association*.

- To let `heatmap` cluster the genes and/or samples, the genes and samples will be clustered (grouped) by correlation coefficients (using `cor`) among the genes and samples.

## Heatmap 3/5

```{r}
# Clustering of the columns (samples)
col_dendrogram = as.dendrogram(hclust(as.dist(1-cor(filtered))))

# Clustering of the rows (genes)
row_dendrogram = as.dendrogram(hclust(as.dist(1-cor(t(filtered)))))
```

![](images/guilt_by_association.jpg)

## Heatmap 4/5
```{r}
# Heatmap with the rows and columns clustered by correlation coefficients
heatmap(filtered, Rowv=row_dendrogram, Colv=col_dendrogram)
```

## Heatmap 5/5
```{r eval=FALSE}
library(gplots) # Load the gplots library
heatmap(filtered, Rowv=row_dendrogram, Colv=col_dendrogram, col = rev(redgreen(1024)))
```

```{r echo=FALSE, message=FALSE}
library(gplots) # Load the gplots library
heatmap(filtered, Rowv=row_dendrogram, Colv=col_dendrogram, col = rev(redgreen(1024)))
```

## Annnotation 1/3
To obtain the functional annotation of the differentially expressed genes, we are going first to extract their probe ids:
```{r}
filterd_ids = row.names(filtered) # ids of the filtered DE genes
length(filterd_ids)
head(filterd_ids)
```

## Sanity Check (Irf6)

![Down Regulation of Irf6](images/irf6_down.png)

```{r echo=FALSE,fig.align="center", fig.width=4, fig.height=3.5}
irf6_id = "1418301_at"
irf6 = filtered[which(rownames(filtered) == irf6_id), ]
boxplot(irf6[1:3], irf6[4:6], col = c("navy", "orange"), names = c("KO", "WT"), main = "")
```

## Multiple Testing Correction 1/3
We conducted `r number_of_genes` statistical tests. The computed *p*-values need to be corrected for *multiple testing*. The correction can be performed using `p.adjust`, which simply takes the orignial *p*-values a vector and returns the adjusted (corrected) *p*-values:

```{r}
adjusted.pvalues = p.adjust(pvalue, method = "fdr")
```

Number **original** *p*-values $\leq$ 0.05 = `r sum(pvalue <= 0.05)` while the number **adjusted** (**corrected**) *p*-values < 0.05 $\geq$ `r sum(adjusted.pvalues <= 0.05)`

## Multiple Testing Correction 2/3
Here is an example of the original *p*-values and corresponding adjusted *p*-values:

```{r echo=FALSE}
df = tibble(pvalue = pvalue, adjusted.pvalue = adjusted.pvalues)
head(df)
```

## Multiple Testing Correction 3/3
```{r echo=FALSE}
p = ggplot(df)
p = p + geom_histogram(aes(x = -log10(pvalue)), bins = 1000, alpha = 0.5, fill = "red")
p = p + geom_histogram(aes(x = -log10(adjusted.pvalues)), bins = 1000, alpha = 0.5, fill = "blue")
print(p)
```

## Homework
- Identify the top 10 *biologically* significant genes (i.e., by fold-change)
- Identify the top 10 *statistically* significant genes (i.e., by *p*-value)

<center>
![](images/joke.png)
</center>

